Goto

Collaborating Authors

 separate training




Asymmetric Conflict and Synergy in Post-training for LLM-based Multilingual Machine Translation

Zheng, Tong, Wen, Yan, Bao, Huiwen, Guo, Junfeng, Huang, Heng

arXiv.org Artificial Intelligence

The emergence of Large Language Models (LLMs) has advanced the multilingual machine translation (MMT), yet the Curse of Multilinguality (CoM) remains a major challenge. Existing work in LLM-based MMT typically mitigates this issue via scaling up training and computation budget, which raises a critical question: Is scaling up the training and computation budget truly necessary for high-quality MMT, or can a deeper understanding of CoM provide a more efficient solution? To explore this problem, we analyze the linguistic conflicts and synergy, the underlying mechanism of CoM during post-training phase. We identify an asymmetric phenomenon in linguistic conflicts and synergy: the dominance of conflicts and synergy varies in different translation directions, leading to sub-optimal adaptation in existing post-training methods. We further find that a significant bottleneck in MMT appears to lie in post-training rather than multilingual pre-training, suggesting the need for more effective adaptation strategies. Building on these new insights, we propose a direction-aware training approach, combined with group-wise model merging, to address asymmetry in linguistic conflicts and synergy explicitly. Leveraging this strategy, our method fine-tunes X-ALMA-13B-Pretrain-trained only with multilingual pre-training-achieving comparable performance to XALMA-13B (only SFT) while using only 20B pretraining tokens and 17B parameters-5.5x fewer pretraining-tokens and 1.7x fewer model size-with just 0.85 COMET drop on Flores-200 testsets of 50 languages.


Slicing Vision Transformer for Flexible Inference

Zhang, Yitian, Coskun, Huseyin, Ma, Xu, Wang, Huan, Ma, Ke, Xi, null, Chen, null, Hu, Derek Hao, Fu, Yun

arXiv.org Artificial Intelligence

Vision Transformers (ViT) is known for its scalability. In this work, we target to scale down a ViT to fit in an environment with dynamic-changing resource constraints. We observe that smaller ViTs are intrinsically the sub-networks of a larger ViT with different widths. Thus, we propose a general framework, named Scala, to enable a single network to represent multiple smaller ViTs with flexible inference capability, which aligns with the inherent design of ViT to vary from widths. Concretely, Scala activates several subnets during training, introduces Isolated Activation to disentangle the smallest sub-network from other subnets, and leverages Scale Coordination to ensure each sub-network receives simplified, steady, and accurate learning objectives. Comprehensive empirical validations on different tasks demonstrate that with only one-shot training, Scala learns slimmable representation without modifying the original ViT structure and matches the performance of Separate Training. Compared with the prior art, Scala achieves an average improvement of 1.6% on ImageNet-1K with fewer parameters. Code is available at here.


Employing Layerwised Unsupervised Learning to Lessen Data and Loss Requirements in Forward-Forward Algorithms

Hwang, Taewook, Seo, Hyein, Jung, Sangkeun

arXiv.org Artificial Intelligence

Recent deep learning models such as ChatGPT utilizing the back-propagation algorithm have exhibited remarkable performance. However, the disparity between the biological brain processes and the back-propagation algorithm has been noted. The Forward-Forward algorithm, which trains deep learning models solely through the forward pass, has emerged to address this. Although the Forward-Forward algorithm cannot replace back-propagation due to limitations such as having to use special input and loss functions, it has the potential to be useful in special situations where back-propagation is difficult to use. To work around this limitation and verify usability, we propose an Unsupervised Forward-Forward algorithm. Using an unsupervised learning model enables training with usual loss functions and inputs without restriction. Through this approach, we lead to stable learning and enable versatile utilization across various datasets and tasks. From a usability perspective, given the characteristics of the Forward-Forward algorithm and the advantages of the proposed method, we anticipate its practical application even in scenarios such as federated learning, where deep learning layers need to be trained separately in physically distributed environments.


Austin city agency offers racially segregated 'anti-racist' trainings for 'white folks' and 'people of color'

FOX News

Fox News host Greg Gutfeld goes over this weeks leftovers and Gutfeld! reacts to the resurfacing of an old training video on DEI by former Navy DEI director Dr. Charles Chuck Barber. A city agency in Austin, Texas invited employees to racially segregated "anti-racist" meetings where "white folks" were asked not to attend a meeting that was only for "people of color." A January email obtained by Fox News Digital reveals Austin's Parks & Recreation Department's equity and inclusion coordinator invited employees to attend "Antiracist Affinity Spaces," consisting of two separate trainings segregated by race as part of an "Equity and Inclusion program." "For People of Color*: Once a month, PARD employees of color will meet up at various city sites," the email says. "The first 1.5 hours will be for fostering dialogue and the last 30 minutes will be for networking. This monthly space will offer folks the opportunities to gather and connect with other PARD employees of color, share about our personal and professional experiences with racism, and learn about mentoring and job opportunities for professional development."


High Dimensional Causal Inference with Variational Backdoor Adjustment

Israel, Daniel, Grover, Aditya, Broeck, Guy Van den

arXiv.org Machine Learning

Backdoor adjustment is a technique in causal inference for estimating interventional quantities from purely observational data. For example, in medical settings, backdoor adjustment can be used to control for confounding and estimate the effectiveness of a treatment. However, high dimensional treatments and confounders pose a series of potential pitfalls: tractability, identifiability, optimization. In this work, we take a generative modeling approach to backdoor adjustment for high dimensional treatments and confounders. We cast backdoor adjustment as an optimization problem in variational inference without reliance on proxy variables and hidden confounders. Empirically, our method is able to estimate interventional likelihood in a variety of high dimensional settings, including semi-synthetic X-ray medical data. To the best of our knowledge, this is the first application of backdoor adjustment in which all the relevant variables are high dimensional.


Separate Training for Conditional Random Fields Using Co-occurrence Rate Factorization

Zhu, Zhemin, Hiemstra, Djoerd, Apers, Peter, Wombacher, Andreas

arXiv.org Artificial Intelligence

The standard training method of Conditional Random Fields (CRFs) is very slow for large-scale applications. As an alternative, piecewise training divides the full graph into pieces, trains them independently, and combines the learned weights at test time. In this paper, we present \emph{separate} training for undirected models based on the novel Co-occurrence Rate Factorization (CR-F). Separate training is a local training method. In contrast to MEMMs, separate training is unaffected by the label bias problem. Experiments show that separate training (i) is unaffected by the label bias problem; (ii) reduces the training time from weeks to seconds; and (iii) obtains competitive results to the standard and piecewise training on linear-chain CRFs.